Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996

Merged
merged 6 commits into from
Nov 11, 2024

Conversation

paulirwin
Copy link
Contributor

@paulirwin paulirwin commented Oct 25, 2024

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Migrate ANTLRv3 to v4, and automate lexer/parser generation.

Fixes #977

Description

The ANTLRv3 NuGet package we depend on has not been maintained, and targets .NET Standard 1.6 which is reporting some vulnerabilities. Additionally, our current codebase has what appears to be hand-ported lexer and parser code from the original Java, which in some cases has been updated manually as well. The original ANTLRv3 grammar is not in our repo. Ideally, we would generate the lexer and parser using ANTLR from the grammar. On top of this, we've had to change some assertions from the original Java due to discrepancies in the implementation.

This adds the 4.8 grammar to the repo, but updates its syntax to use the v4 format. Notably, ANTLRv4 removes the old v3 AST generation and instead lets you walk the parse tree however you see fit. This means that things like "root nodes" (indicated by !) in the old grammar are no longer necessary or supported, and thus require a different approach to walking the syntax tree. Additionally, empty lexer tokens like AT_CALL are no longer supported, so that use was changed into a new call rule instead, with corresponding LUCENENET-specific callout. Error handling is done another way now, resulting in not needing to add inline C# to the grammar. Other than these changes, the grammar is highly similar to the upstream 4.8 v3 grammar. I decided to stick with the 4.8 grammar instead of updating to the most recent grammar which is already in v4 format to keep it as close as possible to the 4.8 code.

The added Antlr4BuildTasks NuGet package (MIT-licensed) generates the lexer and parser code at compile time now, which reduces manual error and shrinks the size of the code we have to maintain. By removing the lexer and parser from the codebase, this PR results in a net reduction of about 3k lines of code. This also is configured to generate a Listener base class, which is the v4 approach that is closest to the existing 4.8 code (as opposed to a Visitor that returns nodes, as the latest Lucene code uses).

The JavascriptCompiler C# code now looks a bit different as a result of implementing the visitor, but this seemed like a cleaner and more maintainable approach than recursively walking every parse tree child. The Context classes that are generated for each rule are now strongly-typed, so you get the benefit of not having to compare rule/token indices. Another goal of this PR is to upgrade this dependency and change the approach in code without changing any existing tests or public API surface. As a result, the listener implementation is a private nested class instead of public, and this also helps keep the logic mostly in line with where it was before (just a little out of order). Note that we are not using the ParseTreeWalker because we want to control the order of entering nodes, so this might look a little different than other implementations of the Listener in ANTLR.

This also fixes some assertions that were changed from the original Java code, because this implementation produces the same results as the Java Lucene 4.8 code. Notably this is due to casting operands to longs before performing shift and bitwise operations, whereas the previous code was erroneously converting all numbers to int/long if there were any shifts or bitwise operations in the source text. This worked fine for some unit test examples, but could quickly break down for anything more complex than that. All numbers are treated as System.Double except for when shifts or bitwise operations are involved, in which case they are casted to Int64/I8 which truncates any decimal place values.

@paulirwin paulirwin added this to the 4.8.0-beta00018 milestone Oct 25, 2024
@paulirwin paulirwin marked this pull request as ready for review October 27, 2024 20:33
@paulirwin
Copy link
Contributor Author

@NightOwl888 Rebased off latest master to resolve a conflict, ready for review.

Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

This seems to work as good as it did previously, so nice work. The only thing we are missing is the attribution in the LICENSE.txt file, which we need to do each time a file or group of files that is under a different license is added.

src/Lucene.Net.Expressions/JS/Javascript.g4 Show resolved Hide resolved
@NightOwl888 NightOwl888 added the notes:improvement An enhancement to an existing feature label Nov 10, 2024
…script.g4. Normalized dividers between different license attributions.
@NightOwl888 NightOwl888 self-requested a review November 11, 2024 09:39
@paulirwin paulirwin merged commit e05e0ff into apache:master Nov 11, 2024
199 checks passed
@paulirwin paulirwin deleted the issue/977 branch November 11, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notes:improvement An enhancement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate to ANTLR v4 with automated lexer/parser generation
2 participants